Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start poller after fast sync is completed #54

Closed
wants to merge 7 commits into from

Conversation

bap2pecs
Copy link
Contributor

@bap2pecs bap2pecs commented Sep 15, 2024

Summary

this is to fix #53

it also makes sense b/c fast sync will fetch the blocks anyway. so it's a waste of resources as well if there is another poller routine doing the same thing

Test Plan

make test-e2e-op
make test
make test-e2e-babylon --run ^TestFastSync$

tested this on GCP, waiting for the FP to run out of BBN. Then, after restarting FP, the sigs submission will work fine. https://gist.github.com/lesterli/830f6ec7cac7c8a9d43b251039d20c8b

@@ -97,6 +97,10 @@ func (cp *ChainPoller) Start(startHeight uint64) error {
}

func (cp *ChainPoller) Stop() error {
if !cp.IsRunning() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this, TestOpFpNoVotingPower will fail b/c it's trying to stop a poller that's not started

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the same as !cp.isStarted.Swap(false)? It's just nil is returned instead of an error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, now TestOpFpNoVotingPower will start the poller, it can be removed.

}

fp.poller = poller
// Start the poller if fast sync is disabled or there's no finalized block
if (fp.cfg.FastSyncInterval == 0 || lastFinalizedBlock == nil) && !fp.poller.IsRunning() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without lastFinalizedBlock == nil, babylon e2e TestFastSync will fail b/c tm.WaitForFpVoteCast(t, fpIns) will timeout

this is b/c in the code below, tryFastSync will not do anything

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move !fp.poller.IsRunning() out to a new if condition? imo if !fp.poller.IsRunning() then we need to return err? Also if this if statement is false, when will the poller start?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this if statement is false, we start the poller after fast sync is finished.

// start poller after fast sync is finished
if !fp.poller.IsRunning() {
   err := fp.poller.Start(res.LastProcessedHeight + 1)
   if err != nil {
   	fp.logger.Error("failed to start the poller", zap.Error(err))
   	fp.reportCriticalErr(err)
   }
   continue
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we should add these check because up to here the fast sync is already finished and should be safe to start the poller. Can we have a workaround in e2e?

@bap2pecs
Copy link
Contributor Author

not sure why CI failed even on lint. make lint runs fine locally

image

@bap2pecs
Copy link
Contributor Author

also seems babylon-private access in CI is still broken

image

@SebastianElvis
Copy link
Member

Thanks for the proposed fix! @gitferry what do you think about starting poller only after fast sync? actually I remember we had a discussion on it but somehow did not push it further

Also, if this idea is approved, how about merging it to main rather than the base branch?

Copy link
Member

@gitferry gitferry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous implementation, the poller is started after bootstrap is finished which involves fast sync, so I don't understand what specific case does this PR want to address. Did I miss anything?

@@ -97,6 +97,10 @@ func (cp *ChainPoller) Start(startHeight uint64) error {
}

func (cp *ChainPoller) Stop() error {
if !cp.IsRunning() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the same as !cp.isStarted.Swap(false)? It's just nil is returned instead of an error

@bap2pecs
Copy link
Contributor Author

@gitferry. ok we just took another look at the logs. so what happened was:

after topping up BBN and restarting the FP, bootstrap will only proceed to the block that has pub rand. when it detects a block that doesn't have PR, it will return without error:

func (fp *FinalityProviderInstance) hasRandomness(b *types.BlockInfo) (bool, error) {
	lastCommittedHeight, err := fp.GetLastCommittedHeight()
	if err != nil {
		return false, err
	}
	if b.Height > lastCommittedHeight {
		fp.logger.Debug(
			"the finality provider has not committed public randomness for the height",
			zap.String("pk", fp.GetBtcPkHex()),
			zap.Uint64("block_height", b.Height),
			zap.Uint64("last_committed_height", lastCommittedHeight),
		)
		return false, nil
	}

	return true, nil
}

then it will consider bootstrap complete and prints "the finality-provider has been bootstrapped"

see relevant logs

2024-09-08T13:06:49.101779Z	info	Starting finality-provider instance	{"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e"}
2024-09-08T13:06:49.104649Z	debug	the finality-provider is entering fast sync	{"start_height": 25444, "target_block_height": 46340}
2024-09-08T13:07:53.076012Z	debug	the finality provider has not committed public randomness for the height	{"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "block_height": 46334, "last_committed_height": 25500}
2024-09-08T13:07:53.090461Z	info the finality-provider has been bootstrapped {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "height": 25501}

cc @SebastianElvis

Copy link
Member

@SebastianElvis SebastianElvis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good if @gitferry can take a final look 👍

}

fp.poller = poller
// Start the poller if fast sync is disabled or there's no finalized block
if (fp.cfg.FastSyncInterval == 0 || lastFinalizedBlock == nil) && !fp.poller.IsRunning() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move !fp.poller.IsRunning() out to a new if condition? imo if !fp.poller.IsRunning() then we need to return err? Also if this if statement is false, when will the poller start?

@gitferry
Copy link
Member

@gitferry. ok we just took another look at the logs. so what happened was:

after topping up BBN and restarting the FP, bootstrap will only proceed to the block that has pub rand. when it detects a block that doesn't have PR, it will return without error:

func (fp *FinalityProviderInstance) hasRandomness(b *types.BlockInfo) (bool, error) {
	lastCommittedHeight, err := fp.GetLastCommittedHeight()
	if err != nil {
		return false, err
	}
	if b.Height > lastCommittedHeight {
		fp.logger.Debug(
			"the finality provider has not committed public randomness for the height",
			zap.String("pk", fp.GetBtcPkHex()),
			zap.Uint64("block_height", b.Height),
			zap.Uint64("last_committed_height", lastCommittedHeight),
		)
		return false, nil
	}

	return true, nil
}

then it will consider bootstrap complete and prints "the finality-provider has been bootstrapped"

see relevant logs

2024-09-08T13:06:49.101779Z	info	Starting finality-provider instance	{"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e"}
2024-09-08T13:06:49.104649Z	debug	the finality-provider is entering fast sync	{"start_height": 25444, "target_block_height": 46340}
2024-09-08T13:07:53.076012Z	debug	the finality provider has not committed public randomness for the height	{"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "block_height": 46334, "last_committed_height": 25500}
2024-09-08T13:07:53.090461Z	info the finality-provider has been bootstrapped {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "height": 25501}

cc @SebastianElvis

It seems that the issue is that fast sync reaches the latest height that pub rand is committed but still far from the tip height, so the fast sync got triggered frequently causing the poller stuck. If it is true then the issue should be why the pub rand is not caught up because we have a separate goroutine to commit pub rand continuously.

@bap2pecs
Copy link
Contributor Author

If it is true then the issue should be why the pub rand is not caught up because we have a separate goroutine to commit pub rand continuously.

@lesterli could you help take a look later today? i will be mostly in events today

@lesterli
Copy link

lesterli commented Sep 20, 2024

If it is true then the issue should be why the pub rand is not caught up because we have a separate goroutine to commit pub rand continuously.

@lesterli could you help take a look later today? i will be mostly in events today

after checking the FP codes, the defaultRandomInterval is 30s and defaultFastSyncInterval is 10s, when fast sync was running and there were many lagging blocks(25444-46340), it is possible that some blocks' pub rand had not been submitted.

There is a similar case when fp was just started:

2024-09-06T14:43:05.123360Z	info	Starting finality-provider instance	{"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e"}
2024-09-06T14:43:05.128529Z	debug	the finality-provider is entering fast sync	{"start_height": 1, "target_block_height": 27}
2024-09-06T14:43:05.132508Z	debug	Successfully batch query blocks	{"start_height": 1, "end_height": 27, "limit": 10, "last_block_hash": "20859a1e111efde38dcdc70037c60d8eba302ab635753d7dd031c8dd180a0064"}
2024-09-06T14:43:05.139480Z	debug	the finality provider has not committed public randomness for the height	{"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "block_height": 1, "last_committed_height": 0}
2024-09-06T14:43:05.169614Z	info	the finality-provider has been bootstrapped	{"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "height": 1}
2024-09-06T14:43:05.169695Z	info	starting the chain poller
2024-09-06T14:43:05.173325Z	info	the chain poller is successfully started

@lesterli lesterli requested a review from gitferry September 21, 2024 02:51
@lesterli
Copy link

bootstrap is finished which involves fast sync

@gitferry in the bootstrap(), should we first check FastSyncInterval? If the fast sync is disabled, we should not trigger fast sync, right?

@bap2pecs
Copy link
Contributor Author

@gitferry so the problem is fast sync cannot fast-sync pub randomnesses. and it doesn't make sense to fast-sync pub randomnesses either b/c iirc we need to wait for some time for a pub rand to be useful.

so it seems we will still need this change

@gitferry
Copy link
Member

@gitferry in the bootstrap(), should we first check FastSyncInterval? If the fast sync is disabled, we should not trigger fast sync, right?

Yes, we have the check at the beginning of checkLaggingLoop

Comment on lines 77 to 78
cp.logger.Info("the poller is already started")
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change to return nil? If so, it will hide the fact that the poller is called twice, which we do not expect.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for a mistake, this update is for the Stop function to fix the failed e2e TestFastSync.

}

fp.poller = poller
// Start the poller if fast sync is disabled or there's no finalized block
if (fp.cfg.FastSyncInterval == 0 || lastFinalizedBlock == nil) && !fp.poller.IsRunning() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we should add these check because up to here the fast sync is already finished and should be safe to start the poller. Can we have a workaround in e2e?

Comment on lines +223 to +230
if !fp.poller.IsRunning() {
err := fp.poller.Start(res.LastProcessedHeight + 1)
if err != nil {
fp.logger.Error("failed to start the poller", zap.Error(err))
fp.reportCriticalErr(err)
}
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the start of the poller should be controlled within the sig submission loop. The poller should be already started when the loop is on

@gitferry
Copy link
Member

@gitferry so the problem is fast sync cannot fast-sync pub randomnesses. and it doesn't make sense to fast-sync pub randomnesses either b/c iirc we need to wait for some time for a pub rand to be useful.

so it seems we will still need this change

we have a separate loop continuously committing pub rand so I think eventually fast-sync will go to the expected height

@bap2pecs
Copy link
Contributor Author

we have a separate loop continuously committing pub rand so I think eventually fast-sync will go to the expected height

right but the problem is bootstrap () will not wait for it. thus the edge case i.e. poller channel gets full still exists. and this PR tries to fix that

@bap2pecs
Copy link
Contributor Author

bap2pecs commented Sep 25, 2024

ok we just realized that this PR still has some problems. i.e. pubrand is lagging a lot so fast sync exits early

https://github.com/babylonlabs-io/finality-provider/compare/base/consumer-chain-support...fix/tmp-fix-poller?expand=1 this might be an easier fix to avoid all the headaches but still resolves #53

basically there are 3 ways I can think of:

  1. sth similar to this PR
  2. increase buffer size (feels this is the easiest path)
  3. block bootstrap() until fast sync of both pub rand and sig submission completed

cc @gitferry @SebastianElvis

@bap2pecs
Copy link
Contributor Author

another discussion topic is whether we should fast sync pub rand

@bap2pecs
Copy link
Contributor Author

discussed w @gitferry.

@lesterli
Copy link

created #77

@bap2pecs
Copy link
Contributor Author

closing as we have removed fast sync

@bap2pecs bap2pecs closed this Nov 26, 2024
@lesterli lesterli deleted the fix/poller-stuck branch December 3, 2024 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants